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Abstract 

The Joint Optimization of Fidelity and Commensurability (JOFC) manifold match¬ 
ing methodology embeds an omnibus dissimilarity matrix consisting of multiple dis¬ 
similarities on the same set of objects. One approach to this embedding optimizes 
the preservation of fidelity to each individual dissimilarity matrix together with com¬ 
mensurability of each given observation across modalities via iterative majorization 
of a raw stress error criterion by successive Guttman transforms. In this paper, we 
exploit the special structure inherent to JOFC to exactly and efficiently compute the 
successive Guttman transforms, and as a result we are able to greatly speed up the 
JOFC procedure for both in-sample and out-of-sample embedding. We demonstrate 
the scalability of our implementation on both real and simulated data examples. 

Keywords: Multidimensional Scaling, manifold matching, distance geometry, distance ma¬ 
trices 


*The authors gratefully acknowledge support from the XDATA program of the Defense Advanced Re¬ 
search Projects Agency (DARPA) administered through Air Force Research Laboratory contract FA8750- 
12-2-0303, and the NSF BRAIN Early Concept Grants for Exploratory Research (EAGER) award DBI- 
1451081. 


1 



1 Introduction and Background 


Manifold matching—embedding mnltiple modality data sets into a common low-dimensional 
space wherein joint inference can be investigated—is an important inference task in sta¬ 


tistical pattern recognition, with applications in computer vision (see, for example, Nastar 


et ah 

1996 

Hardoon et ah, 2004 Elgammal and Lee, |2004 ; |Wang and Suter, 2007 

Ham 

et ah 

2003 

, text and language processing (see, for example. 

Karakos et ah 

2007 

Vinok- 


ourov et ah, 2002 Sahami and Heilman, 2006||), and machine learning (see, for example. 


Wang and Mahadevan, 2008, 2009 Lafon et ah, 2006 Ham et ah, 2005), to name a few; 


for a survey of the literature on manifold matching and the broader problem of transfer 


learning, see Pan and Yang (2010). 

In the present manifold matching framework, we consider n objects, each measured 
under m disparate modalities or conditions, each modality yielding an object-wise dissim¬ 
ilarity matrix thus Ai, A 2 ,..., A^ € The Joint Optimization of Fidelity 


and Commensurability (JOFC) algorithm of Priebe et ah (2013) is a manifold matching 
procedure that simultaneously embeds these mn data points {n objects in m modalities) 
into a common Euclidean space by embedding an omnibus dissimilarity matrix A which 
encapsulates the information contained in the dissimilarities {Aj}™^. The JOFC algorithm 
has proven to be a flexible and effective manifold matching algorithm, with numerous ap- 


plications and extensions in the literature; see 

Ma et al. ( 

2012 

); 

Sun and Priebe 

(2013); 

Lyzinski et al. 

(2013) 

Adah and Priebe 

(2015) 

Shen et al. 

(2016 

). One approach to this 


embedding optimizes the preservation of hdelity to each individual dissimilarity matrix 
(i.e., preserving the within modality dissimilarities) together with the commensurability of 
the observations across modalities (i.e., preserving the cross-modality matchedness of the 
data). This approach embeds A by minimizing Kruskal’s raw stress criterion for metric 


multidimensional scaling (MDS) via successive Guttman transforms (Borg and Groenen 
2005); see Algorithm 


In this paper, we exploit the special structure of the JOFC weight matrix to exactly 
and efficiently compute these successive Guttman transforms. Employing this and further 
computational simplifications, we are able to dramatically speed the JOFC procedure (see 
Algorithm]^ and extend this speedup to out-of-sample embedding for JOFC. In addition. 
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Notation 

Description 

Reference 

Jn 

The n X n matrix of all I’s 

Used throughout 

In 

The n X n identity matrix 

Used throughout 

X 

Final conhguration obtained via 3-RSMDS JOFC and fJOFC 

Sec. 1.2, 1.2.] 

. anc 

2 

d(X) 

The raw stress objective fen. of 3-RSMDS 

Eq. ( 

1 


A 

The omnibus dissimilarity embedded by 3-RSMDS 

Eq. ( 

2 


a(X) 

The raw stress objective fen. of JOFC and fJOFC 

Eq. ( 

3 


A 

The omnibus dissimilarity embedded by JOFC and fJOFC 

Eq. ( 

5 


W 

The weight matrix used in the JOFC and fJOFC embeddings 

Eq. ( 

6 


L 

The combinatorial Laplacian of W 

Sec. 

1.2.1 

and 2 


R(X) 

The R-matrix used in the JOFC Guttman transform updates 

Eq. ( 

7 


V 

The Moore-Penrose pseudoinverse of L 

Sec. 

1.2.1 

and 2 


w 

A modihed weight matrix used in the computation of 

Eq. (: 

10 

) 

a(°) 

The out-of-sample omnibus dissimilarity embedded by fJOFC 

Sec. 

3 


t^x(y) 

The out-of-sample raw stress criterion 

Sec. 

3 


W(o) 

The out-of-sample weight matrix used in the fJOFC embedding 

Sec. 

3 


LF) 

The combinatorial Laplacian of WF) 

Sec. 

3 



Table 1: Table of relevant notation. 


parallelizing the resulting algorithm—see Remark [8]— is immediate. We demonstrate these 
speedups and the utility of the JOFC framework in real and synthetic data examples. 
Notation: To aid the reader, we have collected the frequently used notation introduced 
in this manuscript into a table for ease of reference; see Table [Tj 

1.1 JOFC and Three-Way Raw Stress MDS 

In the JOFC framework, we use Raw Stress MDS to simultaneously embed the m object- 
wise dissimilarity matrices Ai, A 2 ,..., Am € while preserving both the matchedness 
of the objects across modality and the within modality dissimilarities. In this way, JOFC 
is closely related to Three-Way Raw Stress MDS (3-RSMDS). The key difference is that 
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the cross modality matchedness of the objects in 3-RSMDS is enforced via a constraint on 
the feasible region, while in JOFC the matchedness is enforced by adding a snitable term 
into the raw stress criterion. In that light, JOFC can be viewed as a softly constrained 
version of 3-RSMDS. We highlight the commonalities and differences between the two 
approaches below, and in Section |4.1 empirically compare their respective performances 
in an illustrative simulation. For further discussion of the connection between JOFC and 


Three-Way Nonmetric MDS in the context of hypothesis testing, see Castle (2012), Chapter 


Remark 1. While the JOFC algorithm is closely related to 3-RSMDS, it bears mentioning 
the relationship of the algorithm to other existing manifold alignment procedures. Many 
existing algorithms begin with a set of high-dimensional points sampled or observed from 
manifolds in see, for example. Ham et ah (2005); Wang and Suter (2007); Sharma et ah 


(2012). Dimension reduction techniques are then applied jointly to the observations to 
align the manifolds in a common d-dimensional embedding space with d <^k. In JOFC— 


similar to many of MDS and kernel based methods; see, for example, Leeuw and Mair 


(2008); Wang and Mahadevan (2008); Shen et ah (2016)—often the objects’ measurements 
cannot be made in Euclidean space. For example, the views of a single object may represent 
the i. text content, ii. images, iii. communication activity associated with a single social 
media prohle. While these data are non-Euclidean by nature, nonetheless there are well 
established dissimilarities that can be computed within each modality. Indeed, the only 
requirement in the JOFC framework is that we can compute dissimilarities amongst the 
data points within each modality. 


1.2 Three-Way Raw Stress MDS 

In both the 3-RSMDS and the JOFC frameworks, we seek to simultaneously embed the m 
object-wise dissimilarity matrices, and in both regimes, the m dissimilarities are measured 
between the same n objects; i.e., they are produced by repeated measurements or obser¬ 
vations under potentially disparate modalities. Assuming that the entire cross-modality 
correspondence is known a priori between the n objects, Three-Way Raw Stress Multidi- 
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mensional Scaling (3-RSMDS) seeks to find a configuration 


x^= [(x(^y|(x(2))^| ...e 

of the mn points that minimizes the raw stress criterion, 


2=1 j<k 


QTnnxd 


( 1 ) 


subject to the constraint that X^ = for all i G [m] := {1,2,... ,m} (note that to 

remove nonidentihability issues, G is often constrained to satisfy GG^ = /„). In ([^, for 
M G (ijj(M) is the Euclidean distance between the i-th. and j-th rows of M, and for 

i G [n], X*^*) are the embedded points in corresponding to Aj. Adopting the terminology 
in Borg and Groenen (2005), in the dimension-weighting 3-RSMDS model, G is known as 


the group stimulus space, and the Wh) are diagonal matrices with nonnegative diagonal 
entries. In this model, the individual embeddings X*^®^ differ only in the (potentially dif¬ 
ferent) weights— given by the diagonal entries of the respective W^*^’s—they place on the 
dimensions of G. 

The 3-RSMDS dimension weighting model and its variants have been well-studied in 


the literature; see, for example. 

Carroll and Changf 

1970) 

Carroll and Wish 

(1974) 

Schulz 

(1980) 

De Leeuw and Reiser 

(1980) 

Reiser 

(1988) 

Rarshman and Lundy ( 

1984 

). Indeed, 


there are a number of proposed procedures in the literature for solving the Three-Way MDS 
problem under a variety of error criterion, including the INDSCAL algorithm of |Carroll and 
Chang (1970); the IDIOSCAL algorithm of Carroll and Wish ( |1974 ); Schulz (1980); the 
PROXSCAL algorithm of Reiser (1988); and the PARAFAC algorithm of |Harshman and 
Lundy (1984); among numerous others. We note here that minimizing ([^ subject to the 
constraint X*^®^ = GW^*^ for all i G (1,2,... ,m} is equivalent to performing constrained 
Raw Stress MDS on the dissimilarity matrix 


( 2 ) 


Ai 

NA ■■ 

■ NA 

NA 

A 2 ■ ■ 

■ NA 

NA 

NA ■■ 

■ A^ 
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with configuration matrix 


= [(x(^))^|(x( 2))T| ... e 


i)mnxd 


subject to X*^*) = for all i G [m]. The “NA” entries in A represent the reality 

that the dissimilarities across modalities are unknown a priori. This is accounted for in 
the objective function by zeroing out the contribution to the stress associated with these 
entries of A. the weight matrix W is structured to zero out the missing data entries of 
r] in the objective function cr(-).. The constrained MDS iterative majorization algorithm 


of De Leeuw and Heiser (1980) can then be applied to approximately solve the 3-RSMDS 
model. As the JOFC procedure (see Algorithm and the accelerated fJOFC procedure 
(see Algorithm are both iterative majorization MDS procedures, we will provide the 
details of De Leeuw and Heiser (1980) applied to 3-RSMDS for the sake of comparison. 


The procedure of De Leeuw and Heiser (1980) consists of the following two iterated steps, 
given an initialization of the configuration X(o): 

1. At configuration X(t_i), ignoring the constraint that X*^*) = for all i G [m], 

compute the unconstrained update X(t) via the Guttman transform; see 


Borg and 


Groenen (2005). 


2. Set X(p = 


(xs,',Ti(xgyi---i(x<”V 


to be the minimizer of 


trace(X-X(p)^L(X-X(p), 


over X subject to the constraints X*^b = GW*-*^ for all i G {1,2, ...,m}. Here, 
L G is the block diagonal matrix with nin — Jn & in each of the m 

diagonal blocks, where Jn = ^ and is the column vector of all one’s 

in This minimization is often approached by alternating minimizing over G for 
a fixed W and then minimizing over W for a fixed G. 


1.2.1 The JOFC framework 

In the above 3-RSMDS framework, the matchedness of the n observations across the m 
dissimilarities is enforced via the X*^*) = GW^*^ constraints. In the JOFG algorithm, the 
matchedness constraint is built into the objective function as follows. Gontrasting the 
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raw stress criterion in Q, the variant of JOFC we consider seeks to produce an uncon¬ 
strained configuration G (where (X*^*^)"'' = 

(xf^)'''|(X 2 *^)'''| ■ ■ ■ l(Xm^)''' G are the points associated with Aj) that minimizes 

the raw stress criterion 




(*) vij)'\2 


(3) 


2=1 


l<2<j'<m £=1 


fidelity 


commensurability 


where d{-, •) is the Euclidean distance function. The raw stress criterion in JOFC is com¬ 
posed of three major pieces: 


1. The “hdelity” term, YlT=i '^i<j<i<n ~ which measures the faith¬ 
fulness of the embedding to the original dissimilarities, Note the the hdelity 

is equal to the raw stress criterion in 3-RSMDS ([^. 


2. The “commensurability” term, which measures how 

the geometry of the embeddings differs across modality. Similar to the role of the 
Xh) = constraints in 3-RSMDS, in JOFC the commensurability term (softly) 

enforces the matchedness of the n data points across the m modalities. We also 
note that the commensurability is proportional to the objective function of three-way 
Procrustes analysis 

m 

commensurability = E trace(XW - X^^'^ (X® - 

i<j 

m 

= m^trace(X(*)-X)^(X«-X), (4) 

2=1 

where X = m~^ YlT=i 


3. The weighting of the hdelity versus the commensurability of the embedding provided 
by w. If tc ^ 1, then the optimal embedding will preserve the within-modality dis¬ 
similarities at the expense of the cross-modality correspondence; i.e. each Aj will be 
ht separately. If tc S> 1, then the optimal embedding will preserve the cross-modality 
correspondence at the expense of the within-modality dissimilarities; i.e. from Eq (|^ 
we see that w 3> 1 would force all of the X*^®^ to be equal without concern for pre¬ 
serving the original AJs. In light of this, JOFC can be viewed as weakly constrained 
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Raw Stress MDS (see Borg and Groenen (2005) for detail), with w allowing us to 
continuously range between setting all to be equal but otherwise unconstrained 
{w = oo) at one extreme versus embedding the Aj’s completely separately {w = 0) 
at the other. 


The problem of choosing an optimal w was taken up in Adah and Priebe (2015). 


When the individual dissimilarities are normalized to have have ||Aj||^ = 1 for all 


i G [m], the results of Adah and Priebe (2015) suggest that, under suitable model 
assumptions, the performance of the JOFC procedure is relatively robust to the choice 
of w. In application, a data-adaptive w could be chosen via the bootstrapping AUC- 


optimization testing procedure of Adah and Priebe (2015), although we do not pursue 
this further here. 

As in 3-RSMDS, minimizing ([^ can be seen as unconstrained Raw Stress MDS on the 
omnibus dissimilarity matrix 



Ai 

V ■ 

■ h 



0 

NA ■■ 

■ NA 

<1 

V 

A 2 ■ 

■ V 


r] = 

NA 

0 ■■ 

■ NA 


V 

V ■■ 




NA 

NA ■■ 

■ 0 


(5) 


and conhguration 

= [(X^iy |(x(2))T|... |(x("*y] e 

with the associated weight matrix given by 

Jn In "^In 


^imnxd 


w = [w,,] = 


win 


Jn 


win win 


Jn In 


nmnxmn, 


( 6 ) 


indeed, this is immediate as the raw stress criterion in (j^ is equal to cr(X) = 

(ijj(X))^. Note that, as before, the weight matrix W is structured to zero out the missing 
data entries of tj in the objective function cr(-). 
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Note the different structure of A in JOFC versus A in 3-RSMDS. In JOFC, we impute 
the missing across modality dissimilarity between the same object to be 0, which allows us 
to build the matchedness constraint into the raw stress criterion (via the commensurability 
term). In both models, we treat inter-object, cross-modality dissimilarites as missing data, 
and this represents the assumption that these dissimilarites are often not available in the 
embedding procedure. 


Remark 2. In Priebe et ah (2013), the missing cross-modality dissimilarity between modal¬ 
ity i and modality j was imputed as (A* -|- Aj)/2, and A was embedded using classical 
multidimensional scaling. Here we choose not to impute the missing data for two main 
reasons: imputing the cross-modality dissimilarities potentially increases the variance in 
our embedded points; and the special structure of W in the missing data setting allows us 
to greatly speed up and parallelize the JOFC procedure (see Section]^. In addition, in 
many real data settings (see Section the n objects originate from disparate data sources 
and are not simply repeated measurements of the same objects in a single space, which 
further complicates the very concept of cross-modality dissimilarities. 


Similar to the approach in De Leeuw and Heiser (1980) for 3-RSMDS, our JOFC ap¬ 
proach embeds A by minimizing ([^ via successive Guttman transforms. As in the ma- 
jorization algorithm for solving 3-RSMDS, the Guttman transform step of JOFC can be 
efficiently computed (see Algorithm]^. However, in JOFC the matchedness constraint is 
built into the raw stress criterion, and we are therefore able to avoid the potentially costly 


Step 2 of the 3-RSMDS procedure as outlined in Section 0^2} The JOFC algorithm proceeds 
as follows: 


1. Initialize the configuration X(o). One easily implemented initialization imputes the 
missing data entries of A as in Remark and performs classical MDS on A; see Step 
1 of Algorithm [T] for detail. 

2. For a given threshold e > 0, while cr(X(p) — cT(X(f_i)) > e, iteratively update Xi_i via 

the Guttman transform. To wit, let L be the combinatorial Laplacian of the weight 
matrix W (i.e., if D is the diagonal matrix with Wij, then L = D — W), 
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and define 


if i 7 ^ j and dij (X) ^ 0 
if i 7 ^ j and dij (X) = 0 


: = 


di,A^) 

0 


^-ELw.i^(xk, if^ = J. 

Then the raw stress criterion ([^ can be written 


^(^w) = + traceXj)LX(t) - 2traceXj)E(X(i))X(t), 

i<j 

which is majorized by 


(7) 


+ traceX|'J)LX(t) — 2traceX(J)i?(X(t_i))X(i_i), (8) 

i<j 

a qnadratic fnnction of X(t). The minimizer of (|^ can be fonnd by solving the 
stationary eqnation V(T(X(q) = 2LX(q — 2i?(X(t_i))X(f_i) = 0. The Gnttman 
transform npdates a conhgnration by solving LXq) = i?(X(t_i))Xq_i); in 

the mnltidimensional scaling literatnre, this transformation is often written as X( 4 ) = 
r(X(t_i)) = Lt5(X(t_i))X(*_i) where Li is the Moore-Penrose psendoinverse of L. 
Notice that X(t) is centered at zero even if is not centered at zero. 


For JOFC, the resnlting iterative algorithm is snmmarized in Algorithm Note that the 
seqnence of steps generated by snccessive Gnttman transforms is derived via majorization, 
and we note that Algorithm [T] is closely related to the popnlar SMAGOF algorithm for 


metric mnltidimensional scaling; see De Leenw and Reiser (1980); de Leenw (1988). 


Remark 3. In all of the experiments in Section the threshold e is set to 10“®(’^k)j 
i.e., we terminate the procednre when the normalized stress o' 7 v(-) := 
decrease by at least 10“® between snccessive iterations. Note however that, in practice, 
the seqnential Gnttman transforms often exhibit good global properties, and only a few 
iterations are reqnired to obtain a snfficiently good snboptimal embedding, see [Kearsley 
. We empirically observe this phenomena in Fignre where we see that the 
conhgnration obtained by fJOFG can stabilize after only relatively few iterates. 


et al. (1995 
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Algorithm 1 JOFC Algorithm for Manifold Matching (see Section 1.2.1 for detail) 


Require: Omnibus dissimilarity matrix A, weight matrix W, embedding dimension d, 
tol= e 


Ensure: X G 


l)7nnxd 


, a conhguration of points in 


1: Initialize X(o) via cMDS (classical MDS , see Torgerson (1952); Borg and Groenen 


^2. 


(2005) for detail) of A 

i. Set A*^^^ to be the element-wise square of A; i.e., A^^ = (Ajj 

ii. Compute P mn ^ ^ Jmn ) i 

iii. Compute the d largest eigenvalues Ai, A 2 , • • • , A^ of P with corresponding eigenvec¬ 


tors Ui,U 2 , ...,Ud, 

iv. Set X(o) = [ui\u 2 \■ ■ ■ |Mrf]diag(Ai)^/2; 
2: Compute cr(X(o)) 

3: while a (X(p) - cr(Xp_i)) > e do 
4: X(q = LtR(X(i_i))X(,_i) 

5: Compute cr(X(q) 

6: end while 

7: Output the hnal iteration X(finai) 


In general, must be calculated by singular value or QR decomposition, which may 
be prohibitively expensive if mn is large, with computational complexity of order 0{m^n^). 
Fortunately, there are many applications in which the special structure of the weight ma¬ 
trix W allows for direct calculation of L'^, sometimes with subsequent simplihcation of 
LtR(Xp_ i))X(t_i). Examples include the familiar case of unit weights (which is the case for 


the Guttman transform needed in Step 1 of the 3-RSMDS algorithm in Section 1.2) and the 


case of symmetric block-circulant matrices, see Gower and Groenen (1990); Gower (2006). 
In Section we demonstrate that the special structure of JOFC also permits the direct 
calculation of L'^ which then results in a much simplihed calculation of L'^i?(X(f_i))X(t_i). 
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2 Fast JOFC 


In each iteration of the JOFC algorithm (Algorithm [^, we update the configuration via a 
Guttman transform X(p = Computationally, this involves 

1. A single calculation of L^, which naively has algorithmic complexity 0((mn)^) given 
an SVD (or QR decomposition) based pseudoinverse algorithm. Clearly, as iJ does 
not vary in t, we do not need to recalculate this pseudoinverse in every iteration. 

2. Computing Lli?(X(t_i))X(f_i), which has complexity 0{{mn)‘^d). 

Therefore, given a bounded number of iterations and assuming d < mn, the JOFC algo¬ 
rithm has algorithmic complexity 0{{mn)^). 

To speed up the JOFC procedure, we first note that the form of the JOFC weight 
matrix allows us to algebraically compute Lh Next, we show that the resulting form of 
the pseudoinverse allows us to greatly simplify the computation of Lli?(X(i_i))X(j_i). In 
addition, the computation of Lli?(X(t_i))X(t_i) easily lends itself to parallelization. 


2.1 Computing 


The hrst step in speeding up Algorithm is algebraically computing the pseudoinverse Lh 
Here, we present the computation of in the case of a more general weight matrix than 
considered in Eq. (§; namely, we will consider here W of the form 


W = = 


^1,1 ('-C In) 

W2,lln 

^m,l Jn 


'U^l,2-fn 

W2,2{Jn — In) 

U!m,2ln ' 


'^2,min 

Wm,m {Jn - In 


-\mnxmn, 


(9) 


with Wij = Wj^i for all G [m] such that i ^ j. This form of W allows for different 
weightings across and within modalities. The case of equal weights off diagonal, i.e., the 
W in Eq. (|^, will then be realized as a special case of this more general W. 
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In Appendix]^ we prove the following. Writing 


W = 


-^1,2 

-W2,l nW2,2 + ^2j 


-W2,m 


Wm,l 'Wm,2 ' ' ' ' ' ' A 

and diag(taj^j) := diag(tai^i, 1 ^ 2 , 2 , • • • we algebraically compute via 


( 10 ) 


Lt = W-^®In + 


-(w + n (— - diag(wi,i)^^ (— - diag(wi,i)^ yV 1 - — 
\ \mn J J \mn J mn 


®Jn 


While brute force computation of W“^ (and Z) would incur a 0{rn?) cost as opposed 
to the 0{m?n^) cost of a brute force computation of Ll, structured weight matrices can 
greatly simplify this computation. For example, if W is of the form of Eq. (|^, then a brief 
calculation yields that 


= 


n-\-w 

w 

w 

n{n-\-mw) 

w 

n{n-\-mw) 

n-\-w 

n(n+mw) 

w 

n{n-\-mw) 

n{n-\-mw) 

n(n+ 7 nw) 

w 

w 

n+w 

n{n-\-mw) 

71(71+77110) 

n(n+mw) 


( 11 ) 


and 


W + n (— - diag(u;i,i)^ ^ 
\mn J J 


-1 


^ - diag(m.,)') W-‘ - ^ 

ran / mn 


wn^m'^{n-\-wm) 

-m?w^—mnw—n" 

wn^rri^{n-\-wni) 


w'n? 7 n^{n-\-w'in) 


wn'^m?‘{n-\-wrn) 

—m^w^+rn'n?—rnnw—'n? 
wn'^mP' {n-\-wm) 


wn'^mP' {n-\-wm) 


wri^m? {n-\-wm) 

-m? —mnw—n^ 
wn'^mP' {n-\-wm) 


wn'^'m? {n-\-wm) 


We shall see in Section |2.2| how these algebraic computations greatly speed-up the compu¬ 
tation of the Guttman transform in the fJOFC procedure. 

Also note that in implementing the fJOFC algorithm, only needs to be computed. 
Indeed, l)(j„i?(Xq_i)) = = 0, which immediately implies that 

- 1 / 


W + n(— - diag(wi,i)^ ^ (— - diag(wi,i)^ W 

\mn J J \mn J 


1 Jm 


mn 


-B(X(i_i)) — 0^ 
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Resultingly, the Gutman transform in the t-th iteration of fJOFC is computed simply as 


Xp) = (W-' ® In) R(Xp_i))Xp_i). 


Remark 4. The key to computing the form of is realizing that can be written as 

t / 1 \ 1 

L I L T J m.n. 1 J m.n. • 


mn 


mn 


We then compute the exact form of (L + :;^Jmn) by inverting the structured matrix 

L T Jmn hV In T ( Jm diag('U7pi) j Jn' 
mn \mn J 

This inverse computation (Theorem 10 in Appendix can be generalized to the following 


Woodbury-type (Woodbury, 1950) matrix identity for the sum of Kronecker products. Let 
A,B E be matrices such that A and [A + nB) are invertible matrices. Then it 

follows that 


{A®In + B® Jn)~^ = 0 4 - (A + nB)-^BA-^ ® Jn- 


This formula generalizes Theorem 10, and we are presently exploring different use cases for 
such an identity. 


Remark 5. Even given identical initializations, the fJOFC algorithm (Algorithm 2), 
and the JOFC algorithm may not give identical embeddings of A, as JOFC relies on a 
computational approximation of L'^, while fJOFC exactly algebraically computes Lb 


2.1.1 More general weight matrices 


We described above how the structured W of Eq. (§ offers an easily computed form 
for >V“^, and here we will briefly outline some other potentially useful structured weight 
matrices that lend themselves to easily compute W~^. If W is of the form 


w = [W,j] 


^l,l('Jn In) 
'Wl,lW2,2ln 


^1,1^2,24 
'W2,2{Jn — In) 

^2,2^m,m4 ' 


W2,2'^m,niln 


e (M+) 


mnxmn. 

1 


( 12 ) 


w. 




(Jn 
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so that each modality has its own (potentially unique) weight, and the cross modality 
dissimilarities are weighted via a product of the within modality weights, then 




diag(wj,^) ^ _ 1 j 


so that the k, £-th entry of W ^ is equal to 


= 


_1_+_1_ 

1 


lik = i 
else. 


Increasing the weight of the within-modality embeddings can easily be achieved by letting 
W be set to 


W = cn ■ diag(tCi,j) + diag(tCi, 



Im - Jmdiag(wi,i 


in which case 


yy-l _ diag(w^,i) ^ ^ 1 J 

+ Si cn{cn + X]• Wi^i) 

Increasing (resp., decreasing) the value of the constant c will have the effect of emphasizing 
(resp., deemphasizing) the hdelity of the subsequent embedding. 


2.2 Effect on the computation of 

Exploiting the form of iJ computed above, we use the special structure of i?(X(t_i)) to 
simplify and speed up the calculation of the Guttman transform needed in the t-th iteration 
of the JOFC algorithm. 

We hrst note that is block diagonal, with m diagonal blocks each of size nxn. 

We will denote the diagonal blocks of i?(Xq_i)) by Bi, B 2 ,..., B^- By construction, 

= -B(X(t_l))lmn = 0, 


and therefore In^j = Bjln = 0 for all j = 1, 2,..., m. It follows that BjJn = JnBj = 0 for 
all j = 1, 2,..., m. Dehning 


M n + w ^ 

W := ^- -Ini and C := 


w 


n[n 


mw] 


n{n + mw) 


kni 
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we arrive at 


and so 



A' 

C' ■■ 

■ C" 


Bi 

0 ■■ 

■ 0 


C' 

A' ■■ 

■ C" 


0 

B 2 ■■ 

■ 0 


C 

C" ■■ 

■ A' 


0 

0 ■■ 

■ -Bm 










X(,) = Lt5(X(i_i))X(t_i) = 


A' C ■■■ C 

a A' ■■■ c" 

C' C ■■■ A' 


n 


n{n + mw) 


^nm A 


W 


n{n + mw) 


V 


drj, In. ' ' ' In. 


In In 


In In 


From (13), it is immediate that the update is realized via 


Xd) = ^ 

n{n + mw) 




w 


Bi 0 ■■■ 0 

0 B 2 ■■■ 0 

0 0 ■■■ Brn. 



v(l) 


v(2) 


Y(m) 




R "vd) 



R 'V"d) 

j 


R yI™) 


BfX 


(d 


^ n{n + mw) 


(13) 


(14) 


Remark 6. Note that to efficiently compute (|14|), we can first compute each in 


yf-i) 


parallel for i E [m], and then compute the update in Eq. (14). 


2.3 The fJOFC algorithm 


The algebraic computation of in Section 2.1 combined with the computation of the 


Guttman transform of Section 2.2 combine to give us the fJOFC algorithm, which is detailed 
below and in Algorithm 

The fJOFC algorithm proceeds as follows: 

1. Initialize the configuration X(o). If the initialization of the JOFC procedure in Remark 
1^ is too computationally intensive (in particular, the initialization uses cMDS to 
embed the mn x mn omnibus dissimilarity with off-diagonal blocks imputed to be 
(Aj -|- Aj)/2) we could proceed as follows: first, use cMDS to embed the average 
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Algorithm 2 fJOFC: Fast JOFC Algorithm for Manifold Matching 

Require: Omnibns dissimilarity matrix A, weight matrix W, embedding dimension d, 

tol= e 


Ensure: X G a conhgnration of points in 

1: Set .^0 to be the conhgnration obtained via cMDS of (so® Step 1. of Algo¬ 

rithm 0 for detail); Center via 
2: for i=l,2,... ,m do 

3: Set to be the conhgnration obtained via cMDS of Ap ; Center via = 

yAi)j 


4: 


Set to be the orthogonal Procrustes ht of onto 

i. Set T = 

ii. Let U'EV'^ be the singular value decomposition of T; 
hi. Set Xg = ^iUV^ 


5: 

end for 



6: 

Set X J) = 

)xS) 

T|(xsg)-t|...|(x;™v' 

7: 

Compute a 

'X(o)) as in Remark]^ 

8: 

while (t(X(p) — a 

(Xp_i)) > e do 

9: 

for j 

=1,2,.. 

. ,m do 

10: 


Compute 

11: 

end 

for 


12: 

for j 

=1,2,.. 

. ,m do 

13: 


Set XW - B(X«„) 

14: 

end 

for 


15: 

Set XJ) = 

tx!;Ai(xSri---i(x 

16: 

Compute a 

'X(i)) as in Remark]^ 

17: 

end while 




-U) 


(m)'iT 

(b ^ 


n{n-\-nw) 




18: Output the hnal iteration X(finai) 


dissimilarity matrix Aj) /m, obtaining the conhgnration ^o; use cMDS to embed 
each Aj and set x|q^ to be the orthogonal Procrustes ht of the embedding to .^o—see 
Step 4 of Algorithm for detail. 
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2. Given current configuration and error threshold e, while cT(X(t_i)) > e, 


compute the Guttman transform of X(j_i) to obtain X(t) as outlined in Section 2.2 
(lines 9-15 of Algorithm 2). To wit, hrst compute each The update 

is then realized by setting 


v(i) _ 
^(i) - 


n 


n{n + nw) 




w 


B(X™.,)X 


e=i 


n{n + nw) 


(b 


h-l) 


for all j G [m]. Each of these m updates has computational complexity 0{mn^d). 


Remark 7. Further speeding up the fJOFC procedure, from Eq. (|^, we see that to com¬ 
pute ct(X), we need not compute all (™"') pairwise distance between rows of X. Indeed, we 
only need to compute ni{^ + {^)n interpoint distances. Indeed, the hdelity can be written 
as 

m m 

i=l i=l 

and the commensurability requires (™) paired distance calculations amongst the n points 
across the m modalities. 


Given a bounded number of Guttman transform updates, the fJOFC algorithm has 
complexity 0{ni?n^d). Contrasting this with the 0{{mn)^) complexity of JOFC points to 
the dramatic speedup achieved by fJOFC; see Section|^for further empirical demonstrations 
of this computational savings. We also recall that, even with identical initializations, the 
JOFC iterates and fJOFC iterates will not agree in general. The JOFC iterates rely on 
an approximate computation of Ll while the fJOFC iterates utilize an exact algebraically 
computed Lb Hence, the fJOFC iterates are not only more efficiently computed than the 
corresponding JOFC iterates, they are also less noisy. 

Remark 8. Each step of the fJOFC procedure easily lends itself to parallel computation. 
Implemented in parallel, given a bounded number of Guttman transform updates, fJOFC 
has complexity 0{ni?n^d/c) when run in parallel over c cores. 


3 Fast out-of-sample embedding for JOFC 


The out-of-sample embedding framework was developed for classical MDS in Trosset and 


Priebe (2008) and for Raw Stress MDS in Ma (2010). Extending the latter, we develop 
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the out-of-sample embedding framework for JOFC. We then demonstrate how this out-of- 
sample embedding can be dramatically sped-up by exploiting the special structure of the 
associated JOFC weight matrix, akin to the speedup of fJOFC over JOFC, and empirically 
demonstrate the efficiency of the procedure in Section [4~4 

Given a configuration X G obtained via JOFC (or fJOFC) applied to A G 

j^mnxmn^ we observe a new object O, giving rise to the out-of-sample omnibus dissimilarity 



7] 

V 


V 

7 ] 


a(o) 


^ j^m(n+l)xm(n+l). ^(°) _ 


Aj 5i 

5J 0 




where, for each i G [m], represents the within modality dissimilarities between O and 
the in sample-data objects for the i-th modality. 

While we could run JOFC (or fJOFC) on the full if m or n is large this often 
becomes computationally burdensome. Rather, without re-embedding A, we seek to embed 
O into the configuration space determined by X so as to best preserve both the matchedness 
across the m versions of O and the within modality dissimilarities provided by In 

the JOFC Raw Stress framework, the out-of-sample raw stress criterion is given by 


m 

(T^{y) = ^Y^{5i{j)-d{y^f,yj)f+w '^d{yi,yjf , (15) 

i=l j i<j 


out-of-sample fidelity out-of-sample commensurability 


where = [y^ |yj | ■ ■ ■ ly^] G is the configuration obtained for the new out-of-sample 
observation O. 


19 











Reordering the rows and columns of A*-®) slightly, 


= 






NA ■■ 

■ NA 


A 


NA 

<52 ■■ 

■ NA 




NA 

NA ■■ 

■ Sm 

^7 

NA ■■■ 

NA 

0 

0 ■ 

■■ 0 

NA 

■■■ 

NA 

0 

0 ■ 

■■ 0 

NA 

NA ■■■ 


0 

0 ■ 

■■ 0 


we see that the raw stress criterion (15) can be written as 


'Jx(y) = 5^ W<“>(A<J - 


i<j 


with the weight matrix and conhguration given by (where for h, /c G Z > 0, Oh,k 
is the h X k matrix of all O’s ) 


Omn,mn 

® In 

x(°) 

Im ® 1 ,^ wJjYi^rn 


Decompose the Laplacian of via 

mn cols 

m cols 

, , mn rows / 

= 

Li,i 

-hi,2 

m rows \ 

. ^72 

-^2,2 


X 

yi 

y2 


and dehne as in Eq. ([^, a similar decomposition of B is given by 




mn rows 


m rows 


mn cols m cols 
Bi2 

^ 1,2 -^ 2,2 


^1.1 

T 


Bu 


5 ( 2 diagll-' ((5jO 


rf(XW,y(t_i)) 
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where “o” is the Hadamard product, and for each j G [m], 


d(X0),yp_i)) (yp_i))i)’' ■ ■ ’ 


Note that 


-Bi,2 — 




Om 


0. 


6-2 O 


0. 


d(X(2),y(,_i)) 


0. 


0. 

0. 




^ ^ d(X(-),y(t_i)), 


A similar majorization argument to that of in-sample JOFC yields the out-of-sample em¬ 
bedding procedure: 


1. Initialize the out-of-sample conhguration at a random initialization y = y(o). 

2. While cTx(y(i)) — crx(yp_i)) > e for a predetermined threshold e, update y^t) via the 
Guttman transform: 

y(t) = 4 , 2 (^ 1^2 - 42 )^ + 4,2 ■ diag (^ 1 '^ o ^ (16) 

Derivation of this update via majorization is completely analogous to the derivation 
of the JOFC update step, and so details are suppressed. 


As L 2,2 = (n + mw)Im — wJm-, it is immediate that = 
Therefore, to efficiently compute (16), we: 


1 


~Jm + 


n-\-mw ^ n{n-\-mw) 


J m. • 


1. For each j G [m], compute 


ei:= 



1 

d(X0'),yp_i)) 





and 

For each j G [m], this vector-matrix multiplication has complexity 0{nd), and the 
full complexity of this step is 0{nmd). 
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2. Routine computations then yield the following simplihcation of the Guttman trans¬ 
form update: 

0 


(yh))j 


+ 


w 


n + mw n{n -|- mw) 


5^6 


k=l 


+ 






n + mw 


{y(t-i))j + 


w 


n{n + mw) 


■ (y(t-i))fc 


k=l 


each of which has complexity 0{d), and the full complexity of this step is 0{md) 


Given a hxed number of modalities m and a bounded number of iterates in the algorithm, 
the complexity of embedding each new out-of-sample observation is linear in n, allowing for 
this out-of-sample procedure to be efficiently implemented on very large data sets. We note 
that the details for simultaneously embedding k > 1 out-of-sample points are completely 
analogous to the k = 1 case and so are omitted. 


4 Results 

In this section we both compare and contrast fJOFG and 3-RSMDS and demonstrate the 
dramatic run time increase achievable by fJOFG versus JOFG over a variety of real and 
simulated data examples; note that all run times are measured in seconds. In all examples, 
the algorithms were implemented on a MacBook Pro with a 2.6 GHz Intel Gore i5 processor 
and 4GB 1600 MHz DDRS memory. 

4.1 3-RSMDS and fJOFC 

As mentioned previously, fJOFG can be viewed as a softly constrained version of 3-RSMDS. 
Herein, through a simple illustrative experiment, we highlight the advantages (and disad¬ 
vantages) of the fJOFG framework. Let Y G rows which are independent 2- 

dimensional Gaussian((5,5), 12 ) random variables. Letting = max(y) — min(R), for i = 
1, 2, 3, we set to be R +Ei, with the entries of Ei being independent Uniform(—;2/50, z/bQ) 
random variables, which are also independent across i. We set Aj to be the interpoint dis¬ 
tance matrix of 1^. These {1^} represent our n = 400 objects measured under m = 3 
modalities. Let Z G have rows 11,12,... ,400 identical to those in Y and let the 
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Figure 1: For a single Monte Carlo iterate, we plot the embeddings of the three dissimilar¬ 
ities in the matched (top row, left panel is fJOFC, right panel is 3-MDS) and the anomaly 
(bottom row, left panel is fJOFC, right panel is 3-MDS) settings. In the matched setting, 
matched triplets are connected by blue lines. In the anomaly setting, blue lines connect 
the matched points across the embeddings and red lines connect the ten anomaly points 
they are “matched” to in the data. 


first ten rows of Z be independent 2-dimensional Gaussian(( 8 , 8 ), 2 ■ I 2 ) random variables. 
Let Y 4 = Z + E 4 ^, with F ^4 defined analogously to the E[s above. Let A 4 be the interpoint 
distance matrix of Y^. 

We use fJOFC and the INDSCAL algorithm (for 3-RSMDS, as implemented in the 


smacof package (Leeuw and Mair, 2008) in R) to embed (Ai, A2, A3) (the matched setting) 
and (Ai,A 2 ,A 4 ) (the anomaly setting). Results are summarized below in Figure and 
Table 1^ In Figure [TJ for a single Monte Carlo iterate, we plot the embeddings of the three 
dissimilarities in the matched (top row, left panel is fJOFC, right panel is 3-MDS) and the 
anomaly (bottom row, left panel is fJOFC, right panel is 3-MDS) settings. In the matched 
setting, matched triplets are connected by blue lines. In the anomaly setting, blue lines 
connect the matched points across the embeddings and red lines connect the ten anomaly 
points they are “matched” to in the data. 
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Method 

Runtimel 

Stressl 

ARI 

Runtime2 

Stress2 

ARK 

Conf. Ratio 

3-RSMDS 

214.52 

0.042 

0.69 

312.39 

0.18 

0.40 

10.29 

fJOFC 

1.39 

0.03 

0.66 

11.56 

0.16 

0.57 

76.07 


Table 2: The average running time (over 25 MC iterates) is shown as Runtimel (in the 
matched setting) and Runtime2 (in the anomaly setting). The average hnal normalized 
stress is Stressl (in the matched setting) and Stress2 (in the anomaly setting). In the 
matched setting, the ARI gives the a measure of the hdelity of the R'-means clustering of 
the data into 400 clusters (each should contain the three jitters of the same point). In the 
anomaly setting, the ARK column gives a measure of the hdelity of the A'-means clustering 
of the non-anomalous data into 390 clusters (each should contain the three jitters of the 
same point). Lastly, the Conf. Ratio column gives the ratio of the average distance between 
the triplets of points that have the anomalies (the ten outlier triplets) and the triplets that 
are correctly matched in the anomaly setting (the 390 non-outlier triplets). 


Results over 25 MC iterates are summarized in Table The average running time 
is shown as Runtimel (in the matched setting) and Runtime2 (in the anomaly setting). 
The average hnal normalized stress is Stressl (in the matched setting) and Stress2 (in 


the anomaly setting). In the matched setting, the ARI (adjusted Rand index; see Hubert 


and Arabie ( |1985 )) a measure of the hdelity of the iC-means clustering of the data into 


400 clusters (each should contain the three jitters of the same point). In the anomaly 
setting, the column ARK gives the a measure of the hdelity of the iC-means clustering 
of the non-anomalous data into 390 clusters (each should contain the three jitters of the 
same non-anomalous point). An ARI of 1 means that the clustering of the embedded 
points perfectly clusters the repeated observations of the data, while an ARI of 0 indicates 
that the clustering of the embedded points behaves as chance in recovering the clusters of 
the repeated observations. Lastly, the Conf. Ratio column gives the ratio of the average 
distance between the triplets of points that have the anomalies (the ten anomaly triplets) 
and the triplets that are correctly matched in the anomaly setting (the 390 non-anomaly 
triplets). From this simple experiment, we see that fJOFC is empirically i. much faster 
than (this off the shelf implementation of) 3-RSMDS; ii. performs comparably to 3-RSMDS 
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(a) Matched Setting. 


(b) Anomaly Setting. 


II II 

Figure 2: We plot the average (over 25 Monte Carlo iterates) relative error ^ 


(final) \\2 


±2 s.e. over a range of values of k (the x-axis in Figure 2(a)). In the left panel, we plot 
the ratio in the matched setting with m = 4, 5, 6 dissimilarities and in the right panel, 
we plot the ratio in the anomaly setting with m = 4,5,6 dissimilarities, one of which 
contains the anomaly. Note that in the anomaly setting, only the relative error amongst 
the n * (m — 1) + (n — 10) non-anomalous points is plotted. 


when the data are all matched across the modalities with no anomalous behavior—see the 
ARI column in Table hi. is better able to preserve the correct matchedness in the 
presence of anomalous data—see the ARI2 and Conf. Ratio columns of Table results 
which are echoed in 


Sun and Priebe (2013); Shen et ah (2016 


4.2 Error tolerance 


With the same setting as in Section |4.1 we explore the effect of early stopping on the 
global fJOFC output. As mentioned previously, the sequential Guttman transforms often 
exhibit good global properties, and good solutions can often be obtained after only a few 
iterates. To demonstrate this, we plot the relative error (over 25 Monte Carlo iterates) 

l|X(final)-X(fc)||2 


^(final) l|2 


±2 s.e. over a range of values of k (the x-axis in Figure 


2(a)). In the left 


panel, we plot the ratio in the matched setting with m = 4, 5, 6 dissimilarities and in the 
right panel, we plot the ratio in the anomaly setting with m = 4, 5,6 dissimilarities, one 
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Fixed n 


Fixed m 




algorithm 

—fast JOFC 
JOFC 


Figure 3: We embed A G via fJOFC and JOFC using identical initial configura¬ 

tions X(o) =cMDS(A) as in Remark]^ We then plot the average run time (in seconds) 
per iteration (±2s.e.) versus m (left panel) and n (right panel) for both JOFC and fJOFC, 
averaged over 50 Monte Carlo replicates. In the left panel we fix n = 400, and vary 
m = 2, 3,4,5, 6. In the right panel, we fix m = 3 and vary n = 200,400,600,800,1000. 


of which contains the anomaly. Note that in the anomaly setting, only the relative error 
amongst the n * (m — 1) J- (n — 10) non-anomalous points is plotted. We see that, in the 
matched setting, very few sequential iterates are needed before the embedding stabilizes. 
In the anomaly setting—when on average over 100 sequential iterates are needed for the 
algorithm to terminate with e = 10“® tolerance—we see ~ 5% relative error with only 
25 iterates. Indeed, here and in the real data examples, we find that e = 10“® is often 
a conservative tolerance level and a sufficiently good embedding can be obtained with far 
fewer iterates; we are presently investigating methods for adaptively choosing the number 
of iterates, though we do not pursue this further here. 


4.3 JOFC versus fJOFC 

Let Y G ]R400x 2 rows which are independent 2-dimensional Gaussian((5, 5), J 2 ) random 
variables. Letting ^ = max(y) — min(R), for z = 1, 2,..., 6, we set Yi to be R -|- Ei, with 
the entries of Ei being independent Uniform(—x;/50, z/bQ) random variables, which are also 
independent across i. We set Aj to be the interpoint distance matrix of F). These {Yi\ 
represent our n = 400 objects measured under m = 6 modalities. For m = 2, 3,..., 6, we 
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embed the omnibus matrix A (defined as in Section 1.2.1) into with both fJOFC (in 
serial) and JOFC using an identical initial configurations X(o) = cMDS(A), as outlined 
in Remark We plot the average run time per iteration versus m for both fJOFC and 
JOFC in Figure (left panel), averaged over 50 Monte Carlo replicates. Even in this 
relatively small simulation, the decreased runtime speed is dramatically illustrated, even 
with fJOFC run in serial. The ratio of the average run times (JOFC versus fJOFC) is 
(2.86,4.82,6.70, 8.59,10.71) for m = (2, 3,4,5, 6), which suggests that fJOFC is a factor of 
m (~ 1.6m) faster than JOFC here. This corroborates the runtime results in Section 
indeed, as here n is constant, JOFC has complexity 0{m?) while fJOFC has complexity 

We next consider the case of fixed m = 3 and varying n = (200,400,600,800,1000). 
With Y and A defined as above, we again embed A G into via fJOFC (in serial) 

and JOFC using identical initial configurations X(o) =cMDS(A). In Figure]^ (right panel), 
we plot the average run time per iteration versus n for both JOFC and fJOFC, averaged 
over 50 Monte Carlo replicates. Again, note the dramatic speedup achieved by fJOFC, with 
the ratio of the average run times (JOFC versus fJOFC) being (2.10,4.86, 7.45,10.13,12.63) 
for n = (200,400, 600, 800,1000). This suggests that fJOFC is a factor of n («i 0.12n) faster 
than JOFC here, which corroborates the runtime results in Section]^ indeed, as here m is 
constant, JOFC has complexity O(n^) while fJOFC has complexity O(n^). 


4.4 Out-of-sample efficiency 

We next demonstrate the efficiency of the out-of-sample fJOFC procedure. With the same 
data set-up as above (with 3-dimensional Gaussian random variables here, but otherwise 
identical to the data setup used above), we embed all but one object of A G via 

fJOFC, and use the out-of-sample procedure to embed the final object (m views of the 
n-th object). Running time results (in seconds) are plotted in Figure]^ where we plot the 
average running time (in seconds) ±2s.e. of the in-sample and the out-of-sample procedure 
versus m (left panel) and versus n (right panel), averaged over 25 Monte Carlo replicates. 
In the left panel we fix n = 200, and vary m = 10,15, 20, 25, 30. In the right panel, we 
fix m = 10 and vary n = 200, 300,400, 500, 600. As seen previously, the runtime of fJOFC 
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Fixed n 


Fixed m 



Figure 4: We embed all but one object of A G via fJOFC, and use the out-of- 

sample procedure to embed the hnal object (m views of the n-th object). We then plot the 
average running time (in seconds) ±2s.e. of the in-sample and the out-of-sample procedure 
versus m (left panel) and versus n (right panel), averaged over 25 Monte Carlo replicates. 
In the left panel we £x n = 200, and vary m = 10,15, 20, 25, 30. In the right panel, we fix 
m = 10 and vary n = 200, 300,400, 500, 600. 

empirically varies quadratically (in n for fixed m and in m for fixed n). However, we observe 
that the runtime of the out-of-sample procedure empirically varies linearly (in n for fixed 
m and in m for hxed n), which agrees with the computational complexity results of Section 

El 

In Table we show the sum of the residual errors of the out-of-sample embedding versus 
the in-sample embedding— YlT=i ll^(finai)[^’ •] ~ —for hxed n and varying m (top row) 
and for hxed m and varying n (bottom row) averaged over 25 Monte Carlo iterates. For 
each combination of m and n, we hrst embed the full mn x mn dissimilarity A using 
fJOFC. We next embed all but one of the objects (n — 1 objects over m modalities) using 
fJOFC and the n-th object via the out-of-sample procedure of Section and compute the 
sum of the residual errors between the out-of-sample and the in-sample embeddings of the 
n-th object. We see that, for hxed n and varying m, the total error is increasing in m but 
negligible on average per modality. As m is hxed and n varies, the total error is relatively 
constant, which is unsurprising as, in each case, exactly m additional data points are being 
out-of-sample embedded into a hxed dimensional space. 







Sum of Residual Errors of Out-of-Sample Versus In-Sample 



m = 10 

m = 15 

m = 20 

m = 25 

m = 30 

n = 200 

0.067 

0.121 

0.184 

0.366 

0.364 


n = 200 

n = 300 

n = 400 

n = 500 

n = 600 

m = 10 

0.057 

0.059 

0.101 

0.078 

0.091 


Table 3: The sum of the residual errors of the out-of-sample embedding versus the in- 
sample embedding—ll^(finai)[^’ 0 “Yilb—for fixed n and varying m (top row) and for 
hxed m and varying n (bottom row) averaged over 25 Monte Carlo iterates. 


4.5 Real Data Examples 

We next demonstrate the key feature of the JOFC procedure in a pair of real data sets; 
namely, the ability of the algorithm to preserve cross-modality matchedness while not 
forcing incommensurate versions of the data points to be artihcially embedded close to one 
another. Indeed, in the JOFC procedure, 

1. if an object’s properties are well-preserved across the m modalities, then the object’s 
associated m points in the conhguration will be embedded close to each other; 

2. if an object’s properties are not well-preserved across the m modalities, then JOFC 
(with well-chosen w) will not artihcially force the object’s m incommensurate conhg¬ 
uration points to be close to each other in the embedding. 


Incommensurate embeddings can inform both how and why the data modalities diher. By 
studying these pathologies further, we aim to better understand the data features that are 
emphasized in one modality versus another, which is crucial for understanding potential 
benehts from pursuing further inference in the joint (versus single) embedding space. 

We explore this further below in a data set derived from the French and English 
Wikipedia graphs and in a time series of zebrahsh calcium ion brain images from [Prevedel 


et al. (2014). 
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(a) Dendrogram merge heights, all n = 1382 points, (b) Dendrogram merge heights for the n = 1055 points 

with merge height < 100. 


Figure 5: Histograms showing, for each of the n = 1382 points in (a) and for each of 
the 1055 points with merge height < 100 in (b), the height in the hierarchical clustering 
dendrogram when each of the four modalities was hrst merged into a single cluster for that 
point. 


4.5.1 Wikipedia 

We collect the n =1382 articles from English Wikipedia which compose the 2-hop 

neighborhood of the article entitled “Algebraic Geometry” (where articles are linked if there 
exists a hyperlink in one article to the other, and these links are considered undirected). 
There is a natural 1-1 correspondence between these articles and their versions in French 
Wikipedia, and we will denote the associated French articles by {y 2 i}\Ti- 


As in Shen et ah (2016), each for j = 1,2, further gives rise to two measures 

of inter-article dissimilarity: Aji, the shortest path distance in the undirected hyperlink 
graph; and Aj 2 , the cosine dissimilarities between text feature vectors (provided by latent 


semantic indexing, see Deerwester et ah (1990) for detail) associated with each article. 


We use fJOFC—with tc = 10 as suggested by Adah and Priebe (2015)—to embed these 
n = 1382 points across m = 4 modalities into Note that implementing our fJOFC 
algorithm in serial ran in :^42.2 minutes while the JOFC algorithm with the same settings 
ran in :^10.37 hours (a factor of :^14.7 speedup). 

In this omnibus embedding, if all 4 embedded versions of a single Wikipedia article he 
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close together, then this article’s relationship to all of the other articles is preserved across 
modality. If any of the 4 embedded versions is incommensurate with the others then this 
would indicate either: 


i. The text features of the article differ signihcantly across language; i.e. the associated 

row of the embedding associated with A 12 , is far from the embedding 

associated with A 22 . While the French articles are not translations of their English 
counterparts (or vice versa), further understanding the textual feature highlighted 
by these incommensurabilities would be useful before pursuing further inference (e.g. 
topic modeling) in the joint embedding. 

ii. The hyperlink graph structure is not preserved across modality; i.e. the associated 
row of X^^\ the embedding associated with An, is far from x|^\ the embedding 
associated with A 21 . 

iii. The hyperlink structure and the textual similarities are incommensurate; i.e. the 
associated row of X^^\ the embedding associated with An, is far from X^^\ the em- 
bedding associated with A 12 , or the associated row of X) , the embedding associated 
with A 21 , is far from X^'^^ the embedding associated with A 22 . By studying these 
incommensurabilities further, we hope to better understand the data features that 
are emphasized by graph-based versus text-feature-based methodologies. 


To investigate further, we proceed by hierarchically clustering (using Ward’s method. 


see 


Johnson (1967) for detail) the 4 x 1382 points of the omnibus embedding and then 


compute the pairwise cophenetic distance (the height in the resulting dendrogram at which 
the two points are hrst clustered together) between each of the points. If the dissimilarities 
are well preserved across modality, then the maximum cophenetic distance between two 
embedded versions of the same article (we call this the Dendrogram Merge Height or DMH) 
should be small. 


In Figure 5(a) we plot a histogram of the DMH’s for the 1382 articles, and note that 


over 76% of the articles have DMH less than 100. In Figure 5(b) we see that over 63% of 
the articles have DMH less than 10. To further conhrm that the dissimilarities are well 
preserved across modality, we calculated cluster labels given by the hierarchical clustering 
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Figure 6: The adjusted Rand index between the clusters given by the hierarchical clustering 
dendrogram at height h G [0, 2] and the ground truth clustering (given by the 1382 size 4 
clusters each composed of a single article across modalities). 


dendrogram at height h G [0,2], We then compute the adjusted Rand index, ARI (see 


Hubert and Arabie, 1985), between these clusterings and the ground truth clustering (given 


by the 1382 size 4 clusters each composed of a single article across modalities), and plot 
this in Figure From the hgure, we see that the clustering is not only grossly clustering 
the article 4-tuples together, but is also capturing the hne-grain differences between the 
different articles as well. 

If the ARI between the hierarchical clustering and the ground truth clustering was 
equal to 1, then the structure of the four dissimilarities would be nearly identical, and 
joint inference across modality would yield minimal gain over separately embedding the 


Aj’s and then applying subsequent inference methodologies. From Figures 5(a) 5(b) and 
1^ we see this is not the case. Indeed, we see that the text-feature-based methods and 
graph-based methods are emphasizing some different data features both within and across 
language, and therefore for some articles the relative geometry in the four modality-specihc 
embeddings is not commensurate. We illustrate this in Figure where we plot a branch 
of the hierarchical clustering dendrogram when the tree is cut at height 20. Note that 
although the four modality-specihc embeddings of many articles (article 454 is highlighted 
here in blue as an example) are very similar, some of the articles’ embeddings are not 
preserved well across modality (article 366 is highlighted here in red as an example; note 
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Figure 7: A branch of the hierarchical clustering dendrogram when the tree is cut at height 
20. Note that the four modality-specihc embeddings of article 454 (highlighted in blue in 
the dendrogram) are very similar, while those of article 366 are not (the English graph with 
shortest path distance differs signihcantly from the other three modalities for this point). 


that the English graph with shortest path distance differs signihcantly from the other three 
modalities for this article). 


4.5.2 Zebrafish brains 


In Prevedel et ah (2014), the authors combined Light-Field Deconvolution Microscopy and 


pan-neuronal expression of GCaMP, a huorescent calcium indicator that serves as a proxy 
for neuronal activity, to produce a time series of whole-brain zebrahsh neuronal activity at 
near single neuron resolution. The data consists of 5000 realizations of a multivariate time 
series with G for all f, where for each i G [5379], G M represents 

the activity of neuron i at time t. Each time frame [t,t + 1) is 1/20 of a second; i.e. the 
data was collected at 20 Hz. After preprocessing the data and removing some artihcial 
edge neurons, we are left with g for each of f = 1, 2,..., 5000. 

Binning the time stamps into 100 overlapping periods of 5 seconds (so that for each 
T G [100], bin r consists of the matrix of observations 

T 


Z(r) _ j^{50(r-l)+l)| . . . |^(50(r+l)| _ (zj^^ | ■ ■ ■ KZfijjg)^ 


D 5105x100 


), 
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Figure 8: Heatmaps of the Frobenius norm differences between the 20 zebrahsh neuron 
embeddings obtained by fJOFC over a range of tn’s. Each heatmap is a 20 x 20 

grid, where the intensity of the i,j-th entry indicates the difference between the embeddings 
of the n* hsh neurons at times t = i and t = j] more red indicates less difference in the 
embedded space and white indicating very different embeddings. Note the anomalous point 
at r = 23. 


we compute a time series of 100 dissimilarity matrices as follows. For each r, we 

compute the thresholded correlation matrix G ]R5105 x5105 



l{|corr(zf\z5"^)| > 0.7} 


(where the threshold 0.7 was chosen to ensure sufficient sparsity in the resulting Zl^'^^’s). 
These correlation matrices are then transformed to dissimilarity matrices by 

defining 


A^^.^ = 1 

hj 




\Nr{z)UNr{i)y 

where NT-{i) is the neighborhood of neuron i in viewed as a graph. 


Initial change point detection analysis, analogous to that in Park et ah (2015), indicated 
that there was an anomaly in the neural correlations at time t* = 23 and identified n* = 469 
neurons responsible for this anomaly. To explore this further, we use fJOFC to embed a 
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Figure 9: Embeddings of the m = 20 elements of the time-series into obtained 

via fJOFC with w = 10. Each of the 20 plots is on the same set of axes. Note the anomaly 
at r = 23. 


portion of the time series (from times r = 11 to r = 30) obtaining the conhguration 


■ ■ ■ |(X(^°))^] 


T 


If there is an anomaly in the activity of the n* neurons at r* = 23, this should be evinced by 
X(23) signihcantly differing from X^’") for r 7 ^ 23, as seen in Figures and Moreover, the 
embedding can also inform the structure of the anomaly, as we can identify the change in 
structure within the n* = 469 neurons which is responsible for the anomaly in the embedded 


space; see Figure Below, we expound on the details of our embedding procedure and 
hndings. 

Restricting the full dissimilarities to the n* identihed anomalous neurons—yielding a 
times series of 100 dissimilarities in ^469 X 

—we hrst embed the m = 20 elements 
of into To test if there is an anomaly at r = 23, we next compute the 

Frobenius norm differences between the 20 embeddings in the conhguration. 

Results are summarized in Figure where we plot a heatmap of the Frobenius norm 
differences between the over a range of tn’s (plots of the 2-dimensional fJOFC 
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embeddings with w = 10 across r = 11,12,..., 30 are displayed in Figure]^. Each heatmap 
is a 20 X 20 grid, where the intensity of the i, j-th entry indicates the difference between the 
embeddings of the n* hsh neurons in and more red indicates less difference in the 
embedded space and white indicating very different embeddings. We see that, across the 
range of re’s, there is a signiheant anomaly in the embedding at r = 23. This both conhrms 
the initial Endings of an anomaly at r = 23 and demonstrates the potential robustness of 
this anomaly-detection procedure to misspecified w. We also note that the embeddings 
at times r = 11,12,13 are significantly different from the embeddings at all other times. 
Further analysis is needed to determine if this is neuroscientifically significant or a data 
collect/algorithmic artifact. We lastly note that this embedding ran in 1.5 hours using 
fJOFC run in serial and over 20 hours using JOFC, again showing the dramatic speedup 
of our fJOFC procedure. 

To further understand the structure of this anomaly, we plot the change in the embed¬ 
dings from times 21-22, times 22-23, times 23-24, and times 24-25 in Figure (so that 
there are 2n* points in each panel). In the figure, the neurons in the configuration at time 
23 are displayed as red points, with neurons in the configuration at other times displayed 
as black points. For each individual neuron, the movement in the configuration from times 
r to r -|- 1 are highlighted with blue lines; i.e., there is a line connecting the position of the 
neuron at time r to its position at time t + 1. From this figure, we can identify the groups of 
neurons whose change in activity is responsible for the anomaly. Again, further analysis is 
necessary to determine the potential neuroscientific significance of these neurons’ activity. 


5 Conclusion 

The JOFC algorithm has proven to be a valuable and adaptable tool for a variety of 


inference tasks (e.g., graph matching (Lyzinski et ah, 2013); hypothesis testing (Priebe 


et ah, 2013); joint classification (Sun and Priebe, 2013); among others). The key capability 


enabled by our fJOFC algorithm (both in-sample and out-of-sample) versus the JOFC 
algorithm is enhanced scalability in m and u; indeed, for a fixed n, we see a factor of 
m speedup over the JOFC algorithm, and for a fixed m we see a factor of n speed up 
achieved by fJOFC. Additionally, the out-of-sample fJOFC procedure is shown to have 
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Figure 10: Plot of the change in the embeddings from times 21-22, times 22-23, times 
23-24, and times 24-25 (so that there are 2n* points in each panel). In the hgure, the 
neurons in the conhguration at time 23 are displayed as red points, with neurons in the 
conhguration at other times displayed as black points. For each individual neuron, the 
movement in the conhguration from times r to r + 1 are highlighted with blue lines; i.e., 
there is a line connecting the position of the neuron at time r to its position at time r + 1. 


linear runtime in n. Combined with sparse dissimilarity representations of very large data 
sets, this capability to simultaneously embed many different large dissimilarities, both in 
and out-of-sample, enables the complex structure of the data to more easily be interrogated, 
leading to potentially signihcant discoveries heretofore beyond our grasp. 

While the sequential Guttman transforms computed in Algorithmare only guaranteed 
to converge to a stationary conhguration, because the sequence of raw stress values is 
decreasing, in practice they will typically converge to a local minimizer of cr(X). Note that, 
in most cases, the local convergence rate of the iterative Guttman transforms is linear, see 


de Leeuw (1988). In practice, the sequential Guttman transforms often exhibit good global 


properties, and only a few iterations are required to obtain a sufficiently good suboptimal 


embedding, see Kearsley et ah (1995). Analyzing these global properties and/or modifying 
fJOFG to accelerate the linear convergence—for example, by incorporating relaxed updates 
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in the iterative majorization as in De Leeuw and Heiser (1980)—are essential next steps 


for further scaling fJOFC to very big data. 


A Derivation of 

In this section, we collect supporting results to derive the desired form of Lh We hrst prove 
that can be realized via = (L + -^Jmr^ ^ — ^Jmn, 


Proposition 9. Let W be any symmetric weight matrix in 
torial Laplacian ofW, then L can be equivalently realized via 

t / 1 \ -1 1 

L I L -|- Jmn 1 Jmn- 

mn / mn 


mnxmn 


//L is the combina- 


(17) 


Proof. The proof is straightforward linear algebra, but we include it here for completeness. 
We hrst note that Jm^L = LJ^n = 0, so that 

L T Jmn 1 Jmn Jmn Jmn I L T Jmn 
mn / V mn 


We then calculate 


L H- J„ 

mn 


-1 


Jrr 

mn 


L — L ( L -|-- J^t 


L L 


mn 

1 

mn 


Jrr 


-1 


— T - — T 
L T dmn djyiji 

mn mn 


L I Imn Jmn 1 L, 

mn 


and 


/ 1 \ 

^ 1 


/ 1 \ 

^ 1 

1 L -f" Jmn 

^mn 
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1 L -f- Jmn 

^mn 
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mn 


\ mn / 

mn 


1 \ ^ 1 

L T Jmn 1 Jmn 
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d-mn 2 Jmn T Jmn 


mn 
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1 — J _]_ T _ T 
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mitian. It follows that Pl = (L + — Jmn) ^ - —Jmn as desired. 


+ mn mn 
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We have that the combinatorial Laplacian of W is given by L = W®/„ —diag(wj^i)® J„. 
It follows that 


L Jmn hV In ~\~ ( Jni diag(m 2 ji) J ^ Ju' 


mn 


mn 


Proposing that (L + — 




is of the form V ® In + Z ® Jn-, we arrive at the following. 


Theorem 10. With notation as above, let W be a weight matrix of the form of Eg. 0. 
and assume that Wi^i > 0 for all i G {1, 2,..., m}. Let L be the combinatorial Laplacian of 
W, then 


mn 


L ~|~ Jmn ) ( hV In ( Jm d%ag{Wi^jfj j Jn ] P In “1“ Z Jm 


mn 


-1 


where V = W ^ and Z = - {W + n {:^Jm - diagiwi^i))) ^ {:;^Jm - diagiwi^i)) W h 


Proof. First note that the assumption on assures that W is strictly diagonally 

dominant and is therefore invertible. If the proposed form, {L + ^Jmn)~^ = V®In+Z®Jn, 
is correct then 


Imn 1 Jmn 1 I L “1“ Jmn 

mn / V mn 


= ® In + “ diag(M;i,i)^ ® J^ (V ® 4 + Z 0 4,) 

= (>VV) ® In+ + n i — Jm - diag(M;i,i) ) ) Z + i — Jm - diag(M;i,i) ) V 

\\ \mn J J \mn 

From this, the desired forms of V and Z follow immediately. 


D Jn 
□ 


From Theorem 10 the following Corollary is immediate: 


Corollary 11. With notation as above, let W be a weight matrix of the form of Eg. 0. 
and assume that Wi^i > 0 for all i G {1, 2,..., m}. Let L be the combinatorial Laplacian of 
W, then 


V =W-^^In + 


[w + n 



diag{wi^i 


-1 



diag{wi 






mn 


® Jn- 
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